We have already provided some rules to follow as we created plots for our examples. Here we aim to provide some general principles we can use as a guide for effective data visualization. Much of this section is based on a talk by Karl Broman titled “Creating effective figures and tables” including some of the figures which were made with code that Karl makes available on his GitHub repository, and class notes from Peter Aldhous’ Introduction to Data Visualization course.
Following Karl’s approach, we show some examples of plot styles we should avoid, explain how to improve them, and use these as motivation for a list of principles.We compare and contrast plots that follow these principles to those that don’t.
The principles are mostly based on research related to how humans detect patterns and make visual comparisons. The preferred approaches are those that best fit the way our brains process visual information. When deciding on a visualization approach it is also important to keep our goal in mind. We may be comparing a viewable number of quantities, describing a distribution for categories or numeric values, comparing the data from two groups, or describing the relationship between two variables.
As a final note, we also note that for a data scientist it is important to adapt and optimize graphs to the audience. For example, an exploratory plot made for ourselves will be different than a chart intended to communicate a finding to a general audience.
We will be using these libraries:
library(tidyverse)
library(gridExtra)
library(dslabs)
ds_theme_set()
We start by describing some principles for encoding data. There are several approaches at our disposal including position, aligned lengths, angles, area, brightness, and color hue.
To illustrate how some of these strategies compare let’s suppose we want to report the results from two hypothetical polls regarding browser preference taken in 2000 and then 2015. Here, for each year, we are simply comparing five quantities, five percentages.
A widely used graphical representation of percentages, popularized by Microsoft Excel, is the pie chart:
Here we are representing quantities with both areas and angles since both the angle and area of each pie slice is proportional to the quantity it represents. This turns out to be a suboptimal choice since, as demonstrated by perception studies, humans are not good at precisely quantifying angles and are even worse when only area is available.
The donut chart is an example of a plot that uses only area:
To see how hard it is to quantify angles, note that the rankings and all the percentages in the plots above changed from 2000 to 2015. Can you determine the actual percentages and rank the browsers’ popularity? Can you see how the percentages changed from 2000 to 2015? It is not easy to tell from the plot.
In fact, the pie R function help file states:
“Note: Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.”
In this case, simply showing the numbers is not only clearer, but it would save on print cost if making a paper version.
| Browser | 2000 | 2015 |
|---|---|---|
| Opera | 3 | 2 |
| Safari | 21 | 22 |
| Firefox | 23 | 21 |
| Chrome | 26 | 29 |
| IE | 28 | 27 |
The preferred way to plot quantities is to use length and position since humans are much better at judging linear measure. The bar plot uses bars of length proportional to the quantities of interest. By adding horizontal lines at strategically chosen values, in this case at every multiple of 10, we ease the quantifying through the position of the top of the bars.
p2 <-browsers %>%
ggplot(aes(Browser, Percentage)) +
geom_bar(stat = "identity", width=0.5, fill=4, col = 1) +
ylab("Percent using the Browser") +
facet_grid(.~Year)
grid.arrange(p1, p2, nrow = 2)
Notice how much easier it is to see the differences in the barplot. In fact, we can now determine the actual percentages by following a horizontal line to the x-axis.
If for some reason you need to make a pie chart, do include the percentages as numbers to avoid having to infer them from the angles or area:
In general, position and length are the preferred ways to display quantities over angles which are preferred to area.
Brightness and color are even harder to quantify than angles and area but, as we will see later, they are sometimes useful when more than two dimensions are being displayed.
When using barplots it is dishonest not to start the bars at 0. This is because, by using a barplot, we are implying the length is proportional to the quantities being displayed. By avoiding 0, relatively small differences can be made to look much bigger than they actually are. This approach is often used by politicians or media organizations trying to exaggerate a difference.
Here is an illustrative example:
(Source: Fox News, via Peter Aldhous via Media Matters via Fox News) via Media Matters.
From the plot above, it appears that apprehensions have almost tripled when in fact they have only increased by about 16%. Starting the graph at 0 illustrates this clearly:
Here is another example, described in detail here,
which makes a 4.6% increase look like a five fold change.
Here is the appropriate plot:
When using position rather than length, it is not necessary to include 0. This is particularly the case when we want to compare differences between groups relative to the variability seen within the groups.
Here is an illustrative example showing country average life expectancy stratified by continent in 2012:
The space between 0 and 43 in the plot on the left adds no information and makes it harder to appreciate the between and within variability. Here, 0 should not be included.
During President Barack Obama’s 2011 State of the Union Address the following chart was used to compare the US GDP to the GDP of four competing nations:
Note judging by the area of the circles the US appears to have an economy over five times larger than China and over 30 times larger than France. However, when looking at the actual numbers one sees that this is not the case. The actual ratios are 2.6 and 5.8 times bigger than China and France respectively. The reason for this distortion is that the radius, rather than the area, was made to be proportional to the quantity which implies that the proportion between the areas is squared: 2.6 turns into 6.5 and 5.8 turns into 34.1. Proportional to the radius compared to proportional to area:
Not surprisingly, ggplot defaults to using area rather than radius. Of course, in this case, we really should not be using area at all since we can use position and length:
When one of the axes is used to show categories, as is done in barplots, the default ggplot behavior is to order the categories alphabetically when they are defined by character strings. If they are defined by factors, they are ordered by the factor levels. We rarely want to use alphabetical order. Instead we should order by a meaningful quantity.
In all the cases above, the barplots where ordered by the values
being displayed. The exception was the graph showing barplots comparing
browsers. In this case we kept the order the same across the barplots to
ease the comparison. We ordered by the average value of 2000 and 2015.
We previously learned how to use the reorder function,
which helps achieve this goal.
To appreciate how the right order can help convey a message, suppose we want to create a plot to compare the murder rate across states. We are particularly interested in the most dangerous and safest states. Note the difference when we order alphabetically (the default) versus when we order by the actual rate:
data(murders)
p1 <- murders %>% mutate(murder_rate = total / population * 100000) %>%
ggplot(aes(state, murder_rate)) +
geom_bar(stat="identity") +
coord_flip() +
xlab("")
p2 <- murders %>% mutate(murder_rate = total / population * 100000) %>%
mutate(state = reorder(state, murder_rate)) %>%
ggplot(aes(state, murder_rate)) +
geom_bar(stat="identity") +
coord_flip() +
xlab("")
grid.arrange(p1, p2, ncol = 2)
Note that the reorder function lets us reorder groups as
well.
Below is an example we saw earlier with and without reorder. The first orders the regions alphabetically while the second orders them by the group’s median.
We have focused on displaying single quantities across categories. We now shift our attention to displaying data, with a focus on comparing groups.
To motivate our first principle, we’ll use our heights data. A commonly seen plot used for comparisons between groups, popularized by software such as Microsoft Excel, shows the average and standard errors (standard errors are defined in a later lecture, but don’t confuse them with the standard deviation of the data).
The plot looks like this:
The average of each group is represented by the top of each bar and the antennae expand to the average plus two standard errors. If all someone receives is this plot they will have little information on what to expect if they meet a group of human males and females. The bars go to 0, does this mean there are tiny humans measuring less than one foot? Are all males taller than the tallest females? Is there a range of heights? Someone can’t answer these questions since we have provided almost no information on the height distribution.
This brings us to our first principle: show the data. This simple ggplot code already generates a more informative plot than the barplot by simply showing all the data points:
heights %>% ggplot(aes(sex, height)) + geom_point()
For example, we get an idea of the range of the data. However, this plot has limitations as well since we can’t really see all the 238 and 812 points plotted for females and males respectively, and many points are plotted on top of each other. As we have described, visualizing the distribution is much more informative. But before doing this, we point out two ways we can improve a plot showing all the points.
The first is to add jitter: adding a small random shift to each point. In this case, adding horizontal jitter does not alter the interpretation, since the height of the points do not change, but we minimize the number of points that fall on top of each other and therefore get a better sense of how the data is distributed.
A second improvement comes from using alpha blending: making the points somewhat transparent. The more points fall on top of each other, the darker the plot which also helps us get a sense of how the points are distributed.
Here is the same plot with jitter and alpha blending:
heights %>% ggplot(aes(sex, height)) +
geom_jitter(width = 0.1, alpha = 0.2)
Now we start getting a sense that, on average, males are taller than females. We also note dark horizontal lines demonstrating that many reported values are rounded to the nearest integer. Since there are so many points it is more effective to show distributions, rather than show individual points. In our next example we show the improvements provided by distributions and suggest further principles.
Earlier we saw this plot used to compare male and female heights:
Since there are so many points it is more effective to show distributions, rather than show individual points. We therefore show histograms for each group:
However, from this plot it is not immediately obvious that males are, on average, taller than females. We have to look carefully to notice that the x-axis has a higher range of values in the male histogram. An important principle here is to keep the axes the same when comparing data across to plots.
Note how the comparison becomes easier:
Align plots vertically to see horizontal changes and horizontally to see vertical changes. In these histograms, the visual cue related to decreases or increases in height are shifts to the left or right respectively: horizontal changes. Aligning the plots vertically helps us see this change when the axis are fixed:
This plot makes it much easier to notice that men are, on average, taller. If instead of histograms we want the more compact summary provided by boxplots, then we align them horizontally, since, by default, boxplots move up and down with changes in height.
Following our show the data principle we overlay all the data points:
Now contrast and compare these three plots, based on exactly the same data:
Note how much more we learn from the two plots on the right. Barplots are useful for showing one number, but not very useful when wanting to describe distributions.
We have motivated the use the log transformation in cases where the changes are multiplicative. Population size was an example in which we found a log transformation to yield a more informative transformation. The combination of incorrectly using barplots when a log transformation is merited can be particularly distorting.
As an example, consider this barplot showing the average population sizes for each continent in 2015:
From this plot one would conclude that countries in Asia are much more populous than other continents. Following the show the data principle we quickly notice that this is due to two very large countries, which we assume are India and China:
Here, using a log transformation provides a much more informative plot. We compare the original barplot to a boxplot using the log scale transformation for the y-axis:
Note in particular that with the new plot we realize that countries in Africa actually have a larger median population size than those in Asia.
Other transformations you should consider are the logistic transformation, useful to better see fold changes in odds, and the square root transformation, useful for count data.
When comparing income data between 1970 and 2010 across region we made a figure similar to the one below. A difference is that here we look at continents instead of regions, but this is not relevant to the point we are making.
Note that, for each continent, we want to compare the distributions from 1970 to 2010. The default in ggplot is to order alphabetically so the labels with 1970 come before the labels with 2010, making the comparisons challenging.
Comparison is even easier when boxplots are next to each other:
Comparison is easier when color is used to denote the two things compared.
About 10% of the population is color blind. Unfortunately, the default colors used in ggplot are not optimal for this group. However, ggplot does make it easy to change the color palette used in plots. Here is an example of how we can use a color blind friendly pallet. There are several resources that help you select colors, for example this one.
color_blind_friendly_cols <- c("#999999", "#E69F00", "#56B4E9",
"#009E73", "#F0E442", "#0072B2",
"#D55E00", "#CC79A7")
p1 <- data.frame(x=1:8, y=1:8, col = as.character(1:8)) %>%
ggplot(aes(x, y, color = col)) + geom_point(size=5)
p1 + scale_color_manual(values=color_blind_friendly_cols)
Use scatter plots to examine the relationship between two continuous variables. In every single instance in which we have examined the relationship between two continuous variables, total murders versus population size, life expectancy versus fertility rates, and child mortality versus income, we have used scatter plots. This is the plot we generally recommend.
One exception where another type of plot may be more informative is
when you are comparing variables of the same type but at different time
points and for a relatively small number of comparisons. For example,
comparing life expectancy between 2010 and 2015. In this case we might
recommend a slope chart. There is not a geometry for
slope chart in ggplot2 but we can construct one using
geom_lines. We need to do some tinkering to add labels.
Here is a comparison for large western countries:
west <- c("Western Europe","Northern Europe","Southern Europe",
"Northern America","Australia and New Zealand")
dat <- gapminder %>%
filter(year%in% c(2010, 2015) & region %in% west &
!is.na(life_expectancy) & population > 10^7)
dat %>%
mutate(location = ifelse(year == 2010, 1, 2),
location = ifelse(year == 2015 &
country%in%c("United Kingdom","Portugal"),
location+0.22, location),
hjust = ifelse(year == 2010, 1, 0)) %>%
mutate(year = as.factor(year)) %>%
ggplot(aes(year, life_expectancy, group = country)) +
geom_line(aes(color = country), show.legend = FALSE) +
geom_text(aes(x = location, label = country, hjust = hjust),
show.legend = FALSE) +
xlab("") + ylab("Life Expectancy")
An advantage of the slope chart is that it permits us to quickly get an idea of changes based on the slope of the lines. Note that we are using angle as the visual cue. But we also have position to determine the exact values. Comparing the improvements is a bit harder with a scatter plot:
Note that in the scatter plot we have followed the principle use common axes since we are comparing these before and after. However, if we have many points the slope charts stop being useful as it becomes hard to see each line.
Since we are interested in the difference, it makes sense to dedicate one of our axes to it. The Bland-Altman plot, also known as the Tukey mean-difference plot and the MA-plot, shows the difference versus the average:
Here we quicky see which countries have improved the most as it is represented by the y-axis. We also get an idea of the overall value from the x-axis.
We previously showed a scatter plot showing the relationship between infant survival and average income. Here is a version of this plot where we encode three variables: OPEC membership, region, and population:
Note that we encode categorical variables with color, hue, and shape.
The shape can be controlled with the shape argument. Below
are the shapes available for use in R. Note that for the last five, the
color goes inside.
The default shape values are a circle and a triangle for OPEC
membership. We can manually customize these by adding the layer
scale_shape_manual(values = c(8, 10)), where 8 and 10 are
the numbers of the desired shapes from the list above.
dat %>%
ggplot(aes(dollars_per_day, 1 - infant_mortality/1000,
col = region, size = population/10^6,
shape = OPEC)) +
scale_x_continuous(trans = "log2", limits=c(0.25, 150)) +
scale_y_continuous(trans = "logit",limit=c(0.875, .9981),
breaks=c(.85,.90,.95,.99,.995,.998)) +
geom_point(alpha = 0.5) +
scale_shape_manual(values = c(8, 10))
For continuous variables we can use color, intensity or size. We now show an example of how we do this with a case study.
Vaccines have helped save millions of lives. In the 19th century, before herd immunization was achieved through vaccination programs, deaths from infectious diseases, like smallpox and polio, were common. However, today, despite all the scientific evidence for their importance, vaccination programs have become somewhat controversial.
library(dslabs)
data(us_contagious_diseases)
str(us_contagious_diseases)
## 'data.frame': 16065 obs. of 6 variables:
## $ disease : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ state : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : num 1966 1967 1968 1969 1970 ...
## $ weeks_reporting: num 50 49 52 49 51 51 45 45 45 46 ...
## $ count : num 321 291 314 380 413 378 342 467 244 286 ...
## $ population : num 3345787 3364130 3386068 3412450 3444165 ...
We create a temporary object dat that stores only the
Measles data, includes a per 100,000 rate, orders states by average
value of disease and removes Alaska and Hawaii since they only became
states in the late 1950s.
the_disease <- "Measles"
dat <- us_contagious_diseases %>%
filter(!state %in% c("Hawaii","Alaska") & disease == the_disease) %>%
mutate(rate = (count / weeks_reporting) * 52 / (population / 100000))
Can we show data for all states in one plot? We have three variables to show: year, state and rate.
When choosing colors to quantify a numeric variable we chose between
two options: sequential and diverging. Sequential
colors are suited for data that goes from high to low. High values are
clearly distinguished from low values. Here are some examples offered by
the package RColorBrewer:
library(RColorBrewer)
display.brewer.all(type = "seq")
Diverging colors are used to represent values that diverge from a center. We put equal emphasis on both ends of the data range: higher than the center and lower than the center. An example of when we would use a divergent pattern would be if we were to show height in standard deviations away from the average. Here are some examples of divergent patterns:
library(RColorBrewer)
display.brewer.all(type="div")
In our example we want to use a sequential palette since there is no meaningful center, just low and high rates.
We use the geometry geom_tile to tile the region with
colors representing disease rates. We use a square root transformation
to avoid having the really high counts dominate the plot.
dat %>% ggplot(aes(year, state, fill = rate)) +
geom_tile(color = "grey50") +
scale_x_continuous(expand = c(0,0)) +
scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
geom_vline(xintercept = 1963, col = "blue") +
theme_minimal() + theme(panel.grid = element_blank()) +
ggtitle(the_disease) +
ylab("") +
xlab("")
This plot makes a very striking argument for the contribution of vaccines. However, one limitation of this plot is that it uses color to represent quantity which we earlier explained makes it a bit harder to know exactly how high it is going. Position and length are better cues. If we are willing to lose state information, we can make a version of the plot that shows the values with position.
The figure below, taken from the scientific literature [CITE: DNA Fingerprinting: A Review of the Controversy Kathryn Roeder Statistical Science Vol. 9, No. 2 (May, 1994), pp. 222-247] shows three variables: dose, drug type and survival. Although your screen is flat and two dimensional, the plot tries to imitate three dimensions and assigns a dimension to each variable. The extra dimension is not only unnecessary, it makes what should be an easy plot to decipher nearly impossible to draw conclusions from.
Pseudo 3-D.
Plots of maps can be very powerful, very informative and very aesthetically pleasing visualizations. However, they can also be misleading if not created correctly. Here we will introduce one R package for creating effective and informative maps and show how to correct a misleading set of maps.
There are several R packages available that can be used to create
plots of maps. We will be focusing on one of the packages,
maps, because of how intuitive it is, the data available,
and because it works well with ggplot2. We will see several
of the options available with this package, but you can read more about
the maps package here
and see more examples here.
Other available R packages include usmap,
ggmap, ggspatial, sf,
urbnmapr and rnaturalearth, and we recommend
exploring the examples (like this one) and
documentation for these packages for additional types of map plots.
The maps package contains a lot of outlines of
continents, countries, states, and counties. In order to map these using
ggplot2, we must specify which type of map we want with
map_data(). ggplot2 provides the map_data() as
a function that turns a series of points along an outline (from the
maps package) into a data frame of those points. Below we
will see maps of the world, a few countries, the entire US, US states
and US counties, but there are more options available with this
package.
We can save map data as a data frame to see what data is available
and then map with ggplot. Let’s look at the “world”
option.
# Pull out world map data frame
world_map <- map_data("world")
dim(world_map)
## [1] 99338 6
head(world_map)
## long lat group order region subregion
## 1 -69.89912 12.45200 1 1 Aruba <NA>
## 2 -69.89571 12.42300 1 2 Aruba <NA>
## 3 -69.94219 12.43853 1 3 Aruba <NA>
## 4 -70.00415 12.50049 1 4 Aruba <NA>
## 5 -70.06612 12.54697 1 5 Aruba <NA>
## 6 -70.05088 12.59707 1 6 Aruba <NA>
We can see the latitude (lat) and longitude
(long) for different regions of the world are available.
Here, region corresponds to a country and
subregion indicates additional information about that
country if available. For example, the Virgin Islands fall into
different political jurisdictions, so you’ll see “British” and “US”
under subregion for the Virgin Islands.
The column order shows in which order ggplot should
“connect the dots” when plotting. group is important since
functions in ggplot2 can take a group argument which
controls (amongst other things) whether adjacent points should be
connected by lines. If they are in the same group, then they get
connected, but if they are in different groups then they don’t.
Essentially, having two points in different groups means that ggplot
“lifts the pen” when going between them. As we’ll see below, including
group = group in the aes() argument will plot
the outlines of countries. If we don’t include
group = group, the plot will have many lines connecting
different parts of different countries and will be impossible to
interpret.
For this plot we also use fill (to denote the color to
fill the countries with) and color (the outline color for
each country). We will see more options as we create more complex
maps.
# Basic example of a world map
world_map %>% ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "gray", color = "white")
We can also map one specific country, or a few countries. If plotting
one country, the second argument of map_data should include
the name of the country. Note that the name of the country can be typed
using all lowercase or with the first letter capitalized. We also
introduce coord_fixed to make plots even better looking.
This fixes the relationship between one unit in the y direction and one
unit in the x direction. Then, even if you change the outer dimensions
of the plot (i.e. by changing the window size or the size of the pdf
file you are saving it to for example), the aspect ratio remains
unchanged.
india_map <- map_data("world", "India")
india_map %>% ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "cyan4", color = "white") +
coord_fixed(1.2)
If we want to plot more than one country, we need to supply a string
of country names as the second argument. Here we introduce
theme_set(theme_bw()). As we’ve seen before, the default
background color for ggplot2 is a light gray with white
grid lines. If you would prefer a different color background you can use
different themes. theme_set(theme_bw()) makes the
background color white and the grid lines light gray. This just has to
do with preference.
theme_set(theme_bw())
multiple_countries_map <- map_data("world", c("Japan", "Ethiopia", "Australia", "Germany"))
multiple_countries_map %>% ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "darkred", color = "white") +
coord_fixed(1.3)
If you want to remove the longitude and latitude axes as well as the
grid lines, you can use the theme layer and add the
following code to the plot. Since we set a theme for all plots above
using theme_set(theme_bw()), we need to reset the
background to the ggplot default using
theme_set(theme_grey()) first.
theme_set(theme_grey()) # back to the default background
multiple_countries_map %>% ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "darkred", color = "white") +
theme(panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()) +
coord_fixed(1.3)
We can also add fill = region to the aes()
function to fill each country with a different color and add a
legend.
theme_set(theme_grey()) # back to the default background
multiple_countries_map %>% ggplot(aes(x = long, y = lat, group = group, fill = region)) +
geom_polygon(color = "white") +
theme(panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()) +
coord_fixed(1.3)
There are multiple ways of plotting the US with the maps
package. You could use the world data and specify
usa as the country:
us_map <- map_data("world", "usa")
us_map %>% ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "blue4", color = "white") +
coord_fixed(1.3)
But this is quite difficult to see because it includes Guam, a US
territory (it’s difficult to see but this is the set of white dots on
the right side of the plot). A better alternative (assuming you don’t
want to include territories and just US states) would be to use
world2 data and specify usa:
better_us_map <- map_data("world2", "USA")
better_us_map %>% ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "blue4", color = "white") +
coord_fixed(1.3)
world2 centers the plot on the Pacific Ocean, allowing
us to see Hawaii and bigger versions of Alaska and the mainland states
(also known as the lower 48 or contiguous 48).
If you don’t need to plot Alaska and Hawaii, an even better option is
to not use world or world2 and instead use
usa; the maps package provides data
specifically for the mainland states of the US. If we use
usa instead, we get a much better plot of the US (without
Alaska and Hawaii).
We can also experiment with color a bit and have the outline color
set to purple and the fill color set to NA, which leaves it
blank or transparent to the background color.
usa <- map_data("usa")
usa %>% ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(color = "purple", fill = NA)
usa %>% ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(color = "purple", fill = "black")
If we want to show individual states, we should instead use the
state option. But we see a problem - a massive legend is
made with one color per state. This is a problem because it eats up most
of the plot area. We could make the plot bigger to accommodate the
legend, but it would still be difficult for someone without much
knowledge of the states to figure out which color corresponds to which
state. A better option would be to remove the legend and add labels to
the states themselves.
us_states <- map_data("state")
us_states %>% ggplot(aes(x = long, y = lat, fill = region, group = group)) +
geom_polygon(color = "white") +
coord_fixed(1.3)
To remove the legend we add the layer
guides(fill = FALSE).
us_states %>% ggplot(aes(x = long, y = lat, fill = region, group = group)) +
geom_polygon(color = "white") +
coord_fixed(1.3) +
guides(fill = FALSE) # do this to leave off the color legend
Adding state labels with the maps package is more
complex and not worth our time here. If you need to label states, we
recommend using the usmap package and first reading this
post and then this
post.
Mapping US state counties is also possible using the
maps package by using map_data("county"). Here
we choose a red outline and no fill color, but we notice that the county
lines are a little thick and make some counties difficult to see.
AllCounty <- map_data("county")
AllCounty %>% ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(color = "red", fill = NA)
We can change the width of the lines using the size
argument, making individual counties a bit easier to see.
AllCounty %>% ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(color = "red", fill = NA, size = .1 )
Now that we know the basic setup for plotting different kinds of maps, let’s create a meaningful map. Specifically, let’s create a corrected version of a misleading set of maps.
On July 17th, 2020 Andisheh Nouraee tweeted about the following screenshots on Twitter, calling out Georgia’s Department of Public Health for hiding the severity of the COVID-19 outbreak in Georgia with its map visualizations.
Nouraee had been taking daily screenshots and noticed that the legend
numbers kept changing. With changing legend number cutoffs for each
color, it seems like the number of cases is staying steady, rather than
increasing. As ethical data scientists and members of the Harvard Chan
School Community, we know that plots like this spread misinformation and
can lead to worse health outcomes. In order to correct this mistake and
accurately show the severity of the outbreak, we will be recreating this
plot with a common legend for both dates. We’ll use the
maps package and a bit of data wrangling to achieve the
desired final product.
Let’s first make a plot of Georgia and its counties. We specify the
county data as before, but add georgia to
indicate we only want the counties for the state of Georgia to be
plotted.
GeorgiaCounty <- map_data("county", "georgia")
head(GeorgiaCounty)
## long lat group order region subregion
## 1 -82.44862 31.94813 1 1 georgia appling
## 2 -82.42570 31.94813 1 2 georgia appling
## 3 -82.40852 31.94240 1 3 georgia appling
## 4 -82.39706 31.94240 1 4 georgia appling
## 5 -82.38560 31.93094 1 5 georgia appling
## 6 -82.35122 31.91948 1 6 georgia appling
Then we plot the state and counties without any axes or background color or grid lines. For now the counties are white with black outlines, we’ll change this to depend on the number of cases. We’ll also produce 2 plots of Georgia counties, one for July 2nd and one for July 17th.
GeorgiaCounty %>% ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "white", color = "black") +
theme(panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()) +
coord_fixed(1.3)
We have the data needed to make a plot of counties in Georgia, but we
don’t have any COVID-19 case data yet. Using the
us-counties.csv data file posted by the New York Times on
GitHub, we can read in daily new COVID-19 cases data for all
counties in the US starting in January 2020.
url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
cases <- read_csv(url)
head(cases)
## # A tibble: 6 × 6
## date county state fips cases deaths
## <date> <chr> <chr> <chr> <dbl> <dbl>
## 1 2020-01-21 Snohomish Washington 53061 1 0
## 2 2020-01-22 Snohomish Washington 53061 1 0
## 3 2020-01-23 Snohomish Washington 53061 1 0
## 4 2020-01-24 Cook Illinois 17031 1 0
## 5 2020-01-24 Snohomish Washington 53061 1 0
## 6 2020-01-25 Orange California 06059 1 0
We can see we have more data than we need since we are focusing on Georgia and only two specific dates. Let’s filter out rows we won’t need for this plot and save the relevant rows as a new data frame.
dates <- c(ymd("2020-07-02", "2020-07-17"))
georgia_cases <- cases %>% filter(state == "Georgia", date %in% dates)
dim(georgia_cases)
## [1] 320 6
head(georgia_cases)
## # A tibble: 6 × 6
## date county state fips cases deaths
## <date> <chr> <chr> <chr> <dbl> <dbl>
## 1 2020-07-02 Appling Georgia 13001 266 14
## 2 2020-07-02 Atkinson Georgia 13003 155 2
## 3 2020-07-02 Bacon Georgia 13005 254 4
## 4 2020-07-02 Baker Georgia 13007 43 3
## 5 2020-07-02 Baldwin Georgia 13009 543 34
## 6 2020-07-02 Banks Georgia 13011 140 1
Looking back at the original plots, we find that the number of cases
are presented per 100,000 people. We need to know the population of each
county in Georgia before we can create a case rate per 100,000 people.
But we aren’t provided population counts in the
georgia_cases or GeorgiaCounty data frames -
we need to find another dataset to pull this information from and join
it with what we have.
We can get the population data we need from the Georgia
Department of Public Health website. There isn’t a raw version of
this dataset available so we downloaded it, saved it as
county_pop.csv, and placed it in our course data
visualization folder on GitHub.
Now we can read in the data and see what we have.
pop_data <- read_csv("county_pop.csv")
dim(pop_data)
## [1] 161 12
head(pop_data)
## # A tibble: 6 × 12
## county_name cases county_id `State FIPS code` `County FIPS code` population
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Appling 1054 US-13001 13 1 18561
## 2 Atkinson 447 US-13003 13 3 8330
## 3 Bacon 602 US-13005 13 5 11404
## 4 Baker 84 US-13007 13 7 3116
## 5 Baldwin 2132 US-13009 13 9 44428
## 6 Banks 501 US-13011 13 11 19982
## # … with 6 more variables: hospitalization <dbl>, deaths <dbl>,
## # `case rate` <dbl>, `death rate` <dbl>, `14 day case rate` <dbl>,
## # `14 day cases` <dbl>
The state counties are listed under county_name and
population under population. These are the only 2 columns
we need for joining population data to our cases data frame. First we
only keep the county_name and population
columns to make the join simpler and to make sure we don’t create any
duplicate columns when joining to the cases data frame. Next we use
left_join to add the population for each county to the
cases data frame. Note that the column names for county are different in
each dataset - in the cases data frame the column name is
county while it is county_name in the
population data frame. We could rename county_name to
county, or we can simply let the join function know that
the data frames have different names for the same thing using
by = c("county" = "county_name").
pop_data <- pop_data %>% select(county_name, population)
data_full <- left_join(georgia_cases, pop_data, by = c("county" = "county_name"))
dim(data_full)
## [1] 320 7
head(data_full)
## # A tibble: 6 × 7
## date county state fips cases deaths population
## <date> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2020-07-02 Appling Georgia 13001 266 14 18561
## 2 2020-07-02 Atkinson Georgia 13003 155 2 8330
## 3 2020-07-02 Bacon Georgia 13005 254 4 11404
## 4 2020-07-02 Baker Georgia 13007 43 3 3116
## 5 2020-07-02 Baldwin Georgia 13009 543 34 44428
## 6 2020-07-02 Banks Georgia 13011 140 1 19982
Now that the population column has been added to the
cases data frame and a new combined data frame,
data_full, has been created, we can create a cases per
100,000 population variable called rate.
data_full <- data_full %>% mutate(rate = 100000*(cases/population))
head(data_full)
## # A tibble: 6 × 8
## date county state fips cases deaths population rate
## <date> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2020-07-02 Appling Georgia 13001 266 14 18561 1433.
## 2 2020-07-02 Atkinson Georgia 13003 155 2 8330 1861.
## 3 2020-07-02 Bacon Georgia 13005 254 4 11404 2227.
## 4 2020-07-02 Baker Georgia 13007 43 3 3116 1380.
## 5 2020-07-02 Baldwin Georgia 13009 543 34 44428 1222.
## 6 2020-07-02 Banks Georgia 13011 140 1 19982 701.
We have one more join to do before we can create the plot. The
data_full data frame needs to be joined to the
GeorgiaCounty data frame we created earlier that contains
the mapping data for the counties. Before we can join, look at the
county column in data_full and
subregion column in GeorgiaCounty. These are
both referring to the counties in Georgia, but the county
entries are capitalized while subregion entries are not. If
we want to join these data frames, the quickest way will be to make the
county column entries lowercase using
str_to_lower and then joining.
georgia_map <- data_full %>% mutate(county = str_to_lower(county)) %>%
left_join(GeorgiaCounty, by = c("county" = "subregion"))
dim(georgia_map)
## [1] 10192 13
head(georgia_map)
## # A tibble: 6 × 13
## date county state fips cases deaths population rate long lat group
## <date> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## 2 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## 3 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## 4 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## 5 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## 6 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## # … with 2 more variables: order <int>, region <chr>
Now we are ready to start plotting. Before we specify the cutoffs and
colors we want, let’s create the default version of the plot with
counties filled by case rate and a subplot for each date. Note that we
need to remove the fill = "white" code we used earlier.
georgia_map %>% ggplot(aes(x = long, y = lat, group = group, fill = rate)) +
geom_polygon(color = "black") +
theme(panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()) +
coord_fixed(1.3) +
facet_grid( ~ date)
One thing we notice is that one county is white - it’s missing rate
data. On closer inspection of the GeorgiaCounty and
georgia_cases data frames, we notice that DeKalb county is
saved as “de kalb” in the GeorgiaCounty data frame, but as
“DeKalb” in the in the georgia_cases data frame. That extra
space means that even if we make all letters lowercase,
de kalb and dekalb won’t match and when we
join the data frames, DeKalb county will be missing rate data. We need
to remove the space in the GeorgiaCounty data frame and
then join the data frames again.
GeorgiaCounty <- GeorgiaCounty %>%
mutate(subregion = str_replace(subregion, " ", ""))
georgia_map <- data_full %>% mutate(county = str_to_lower(county)) %>%
left_join(GeorgiaCounty, by = c("county" = "subregion"))
head(georgia_map)
## # A tibble: 6 × 13
## date county state fips cases deaths population rate long lat group
## <date> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## 2 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## 3 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## 4 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## 5 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## 6 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## # … with 2 more variables: order <int>, region <chr>
georgia_map %>% ggplot(aes(x = long, y = lat, group = group, fill = rate)) +
geom_polygon(color = "black") +
theme(panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()) +
coord_fixed(1.3) +
facet_grid( ~ date)
Uh oh. It’s happened again but with 2 different counties. The names
of the counties are not matching up again. Looking at a map of Georgia
counties we can match the white counties to Ben Hill and Jeff Davis.
Both of these had spaces in the GeorgiaCounty and
georgia_map data (and thus matched), but we removed the
spaces and now they don’t match. We have 2 options: remove all spaces in
both data frames, or go back and only remove the space in DeKalb. Since
we don’t need the names of the counties in this map, let’s remove all
spaces from both data frames.
data_full <- data_full %>%
mutate(county = str_replace(county, " ", ""))
georgia_map <- data_full %>% mutate(county = str_to_lower(county)) %>%
left_join(GeorgiaCounty, by = c("county" = "subregion"))
georgia_map %>% ggplot(aes(x = long, y = lat, group = group, fill = rate)) +
geom_polygon(color = "black") +
theme(panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()) +
coord_fixed(1.3) +
facet_grid( ~ date)
This looks good and can see that there is a common legend for both plots - exactly what we want! Now we just need to change the colors a bit and customize the legend cutoff values, labels and title.
We’ll use the base R cut function to specify case rate
ranges and create a new column in our georgia_map data
frame named manual_fill to use in our plot. The first
argument of the cut function specifies what variable or
column you want to split into ranges. The cut points are then specified
with the breaks argument, and the labels for the resulting
intervals are specified with the labels argument. Finally,
we need to add right = TRUE to indicate the intervals
should be closed on the right and open on the left. This means, for
example, an interval for 1-10 is represented as (1, 10] where 10 is
included but 1 is not.
georgia_map <- georgia_map %>%
mutate(manual_fill = cut(rate, breaks = c(0, 620, 1070, 1622, 2960, Inf),
labels = c("1-620", "621-1,070", "1,071-1,622",
"1,623-2,960", ">2,960"),
right = TRUE))
head(georgia_map)
## # A tibble: 6 × 14
## date county state fips cases deaths population rate long lat group
## <date> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## 2 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## 3 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## 4 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## 5 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## 6 2020-07-02 appling Geor… 13001 266 14 18561 1433. -82.4 31.9 1
## # … with 3 more variables: order <int>, region <chr>, manual_fill <fct>
One final step before plotting is specifying the colors we want. The colors chosen from this resource are as close as possible to the colors used in the original plots.
# Specify desired colors
pal <- c("lightskyblue2", "steelblue2", "dodgerblue3", "dodgerblue4", "red")
Using manual_fill to fill the counties and a
scale_fill_manual layer specifying the title and labels of
the legend, we get the desired result and can now see that the cases
have actually increased between July 2nd and July 17th.
date.labs <- c("July 2, 2020", "July 17, 2020")
names(date.labs) <- c("2020-07-02", "2020-07-17")
georgia_map %>% ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = manual_fill), color = "black") +
scale_fill_manual(name = "Cases per 100,000", values = pal) +
coord_fixed(1.3) +
theme(panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()) +
ggtitle("COVID-19 Cases per 100K") +
facet_grid(. ~ date, labeller = labeller(date = date.labs))
And we can see the difference between our corrected version and the misleading original.
If we want to know which counties are in red on July 17th, we can add
their names as labels to our graph. Note that because we are using
facet_grid the county names will appear on both
subplots.
georgia_map_labels <- georgia_map %>%
filter(rate > 2960) %>%
group_by(county) %>%
summarize(lat = mean(lat),
long = mean(long))
library(ggrepel)
georgia_map %>% ggplot(aes(x = long, y = lat)) +
geom_polygon(aes(fill = manual_fill, group = group), color = "black") +
scale_fill_manual(name = "Cases per 100,000", values = pal) +
coord_fixed(1.3) +
geom_text_repel(data = georgia_map_labels, aes(long, lat, label = county), color = "orange") +
theme(panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()) +
ggtitle("COVID-19 Cases per 100K") +
facet_grid(. ~ date, labeller = labeller(date = date.labs))
If you don’t want this to happen and only want to label the red
counties for each date, you would plot each date separately and then
combine the plots using the ggarrange function from the
egg package. The ggarrange function is similar
to the grid.arrange function, but ggarrange
has the added benefit of making the subplots the same size. If we were
to use grid.arrange, the July 2nd plot would be much bigger
than the July 17th plot because the July 2nd plot does not include a
legend.
july_2_labels <- georgia_map %>%
filter(date == "2020-07-02", rate > 2960) %>%
group_by(county) %>%
summarize(lat = mean(lat),
long = mean(long))
july_17_labels <- georgia_map %>%
filter(date == "2020-07-17", rate > 2960) %>%
group_by(county) %>%
summarize(lat = mean(lat),
long = mean(long))
p1 <- georgia_map %>% filter(date == "2020-07-02") %>%
ggplot(aes(x = long, y = lat)) +
geom_polygon(aes(fill = manual_fill, group = group), color = "black") +
scale_fill_manual(values = pal) +
coord_fixed(1.3) +
geom_text_repel(data = july_2_labels, aes(long, lat, label = county), color = "orange") +
theme(legend.position = "none",
panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()) +
ggtitle("July 2, 2020")
p2 <- georgia_map %>% filter(date == "2020-07-17") %>%
ggplot(aes(x = long, y = lat)) +
geom_polygon(aes(fill = manual_fill, group = group), color = "black") +
scale_fill_manual(name = "Cases per 100,000", values = pal) +
coord_fixed(1.3) +
geom_text_repel(data = july_17_labels, aes(long, lat, label = county), color = "orange") +
theme(panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()) +
ggtitle("July 17, 2020")
library(egg)
ggarrange(p1, p2, nrow = 1)